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1.  Introduction 

Power  consumption  has  quickly  become  a  key  design  constraint  in  microprocessor  designs, 
from  low-end  embedded  processors  to  high-end,  high-performance  systems.  The  embedded  pro¬ 
cessors  found  in  PDAs  and  cell  phones  must  utilize  energy  efficient  designs,  as  their  energy  pay- 
load  is  limited  by  form  factor  and  weight  constraints.  With  battery  power  density  improving  only 
at  a  rate  of  about  5%  per  year,  increase  in  battery  lifetime  must  come  about  through  improvements 
in  the  energy  efficiency  of  system  components.  To  create  power-sensitive  designs,  accurate  power 
estimation  combined  with  architectural  or  system  level  performance  simulation  is  a  key  design 
tool  that  permits  rapid  early  design  studies  that  gauge  trade-offs  between  performance  and  power. 

Recently,  several  microarchitectural-level  power  estimation  tools  have  been  introduced  [1, 

2,  3]  in  academia,  and  they  have  been  widely  adopted  for  use  in  design  studies  that  require  power 
modeling.  In  all  of  these  tools,  microprocessor  power  is  estimated  by  accruing  power  as  estimated 
by  the  power  models  for  each  access  to  microarc hitectural  functional  blocks.  In  Wattch  [I], 
Brooks  et  al.  extended  CACTI,  an  access  and  cycle  time  model  for  on-chip  caches  [4],  to  model 
the  power  dissipation  of  on-chip  storage  blocks  such  as  caches,  register  files,  and  branch  target 
buffers.  The  model  used  in  Wattch  resorts  to  a  fast  approximation  that  is  well  suited  for  high-end 
designs  containing  large  and  complex  memory,  but  the  power  consumption  of  datapath  and  exe¬ 
cution  blocks  is  estimated  by  a  single,  per-access  value,  which  is  not  scalable  for  the  technology 
nor  different  circuit  styles.  Although  their  approach  is  a  good  approximation  for  the  high-end 
application  domain,  we  believe  that  embedded  designs  require  more  accurate  modeling  based  on 
the  specific  switching  activity  within  each  execution  block. 

In  SimplePower  [2],  Vijaykrishnan  et  al.  incorporated  register- transfer  level  (RTL)  power 
models  based  on  look-up  tables  (LUT)  into  a  microarchitectural  simulator.  Each  LUT  contains  a 
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set  of  pre-characterized  power  dissipations  for  a  datapath  eomponent,  and  eaeh  entry  of  the  LUT, 
indexed  by  the  Hamming  distanee  between  subsequent  input  veetor  pairs,  returns  the  estimated 
power  of  the  eomponent  [5].  The  objeetive  of  this  tool  is  to  provide  a  framework  to  quiekly  eval¬ 
uate  a  range  of  arehiteetural  and  algorithmie  trade-offs  during  the  early  design  stages.  To  this  end, 
it  targets  a  reference  processor  design  for  the  pre-computation  of  capacitance  tables.  This  refer¬ 
ence  design,  while  accurate  enough  for  the  purpose  of  trade-off  analysis,  is  not  easily  modifiable 
to  describe  specific  alternative  designs  that  may  have  different  datapath  widths,  smaller  feature 
sizes,  or  different  technologies.  On  the  other  hand,  an  evaluation  of  power  dissipation  in  later 
design  stages  would  obviously  benefit  from  referencing  the  specific  design  under  development. 
Those  microarchitectural-level  power  modeling  tools  have  been  invaluable  in  giving  computer 
architects  the  insights  necessary  to  develop  first-generation  microarchitectural  power  optimiza¬ 
tions.  However,  a  rapidly  changing  technology  landscape  combined  with  increasingly  complex 
microarchitectural  features  has  brought  about  an  erosion  in  the  fidelity  of  existing  power  models. 

In  this  report,  we  outline  the  methodology  behind  the  Sim-Panalyzer  program.  It  is  an  aug¬ 
mentation  to  the  SimpleScalar  performance  simulator  that  allows  the  user  to  estimate  power  con¬ 
sumption.  It  is  broken  out  into  several  components  that  model  distinct  parts  of  a  computer:  cache 
power  models;  datapath  and  execution  unit  power  models;  clock  tree  power  models;  and  I/O 
power  models.  These  power  models  can  be  configured  into  an  augmented  SimpleScalar  simulator 
that  will  then  produce  power  consumption  figures. 

There  are  a  number  of  artifacts  in  SimpleScalar  that  can  cause  the  event  counting  needed  to 
calculate  dynamic  power.  These  artifacts  are  discussed  in  [27].  Our  power  analyzer  program,  Sim- 
Panalyzer,  accounts  for  these. 
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The  rest  of  the  report  is  divided  as  follows.  Section  2  details  the  infrastructure  and  explains 
how  to  use  Sim-Panalyzer.  Sections  3  through  5  provide  some  background  for  our  approach  to 
modeling  MOSFETs,  interconnect,  and  circuits  in  general.  Sections  6  through  9  apply  these  mod¬ 
eling  techniques  to  cache  power,  datapath  and  execution  unit  power,  clock  tree  power,  and  I/O 
power,  respectively.  Section  10  adds  some  concluding  remarks.  The  appendix  on  Sim-iPAQ  is 
included  because  it  represents  a  typical  low  power  platform,  and  it  was  developed  as  part  of  the 
same  project  that  supported  the  development  of  Sim-Panalyzer.  Finally,  the  Publications  section 
lists  papers  that  were  written  with  partial  support  of  this  project. 
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2.  Infrastructure  for  Microarchitectural  Power  Simulation 

Sim-Panalyzer  is  an  infrastructure  for  microarchitectural  power  simulation.  It  is  imple¬ 
mented  on  top  of  “sim-outorder”,  a  component  within  the  SimpleScalar  [6]  simulator.  To  mini¬ 
mize  the  modification  of  the  original  “sim-outorder. c”,  we  implemented  minimum  interfaces  to 
gather  microarchitectural  activities  such  as  cache  accesses.  Originally,  this  project  was  targeted 
for  the  ARM  instruction  set  architecture  (ISA)  in  which  there  are  no  complex  microarchitectural 
block  such  as  an  instruction  queue  (IQ),  re-order  buffer  (ROB),  branch  predictor,  floating-point 
unit,  etc.  However,  now  we  also  provide  a  platform  for  the  Alpha  ISA.  Our  main  focus  is  on  basic 
microarchitectural  blocks  and  major  power  dissipation  sources  such  as  clock  distribution  trees, 
external  I/O,  on-chip  memories,  and  execution  blocks. 

The  user  must  specify  an  effective  switching  capacitance  per  access,  which  is  used  to  com¬ 
pute  the  energy  dissipation  of  each  microarchitectural  block.  Sim-Panalyzer  computes  the  energy 
dissipation  with  the  switching  capacitance  multiplied  by  the  number  of  microarchitectural 
accesses.  A  different  scheme  is  applied  for  external  I/O  accesses;  we  provide  a  more  detailed 
transaction  model  to  count  I/O  pin  switches  in  a  cycle  accurate  way.  We  also  provide  various  rou¬ 
tines  and  parameters  to  guide  the  user  to  estimate  the  effective  switching  capacitance  per  access  of 
each  block.  Those  routines  are  technology  scalable.  Thus,  we  provide  usages  and  examples  on 
how  to  port  different  technologies  into  our  infrastructure. 

In  addition  to  these  features,  we  provide  a  power  modeling  methodology  and  library  to  sup¬ 
port  more  sophisticated  and  accurate  power  models.  In  the  library,  we  provide  basic  building 
blocks  for  the  embedded  logic  simulator  and  switching  capacitance  extraction  for  CMOS  gates. 
The  logic  simulator  collects  the  number  of  switchings  in  each  internal  node  of  the  target  circuit  or 
functional  block  and  the  capacitance  extractor  estimates  the  switching  capacitance  of  each  node. 
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This  library  supports  hierarchical  implementations  of  functional  blocks.  Thus,  the  users  can  re¬ 
use  the  previously  implemented  sub-blocks  to  build  more  complex  functional  blocks. 

2.1  Where  to  Get  the  Source  Code 

Source  code  can  be  downloaded  from  our  website.  Go  to  it  at  http://www.eecs.umich.edu/'-pana- 
lyzer,  and  click  on  the  link  to  Sim-Panalyzer  2.0.  The  source  code,  “sim-panalyzer-2.0.tar.gz”,  is  a 
compressed  tar  ball  fde. 

The  tar  ball  version  has  been  created  in  a  Linux  x86  environment.  We  have  not  tested  our 
code  for  other  operating  systems  and  target  machines. 

2.2  How  to  Compile 

Untar  “sim-panalyzer-2.0.tar.gz”  into  your  install  directory.  Sim-Panalyzer  has  currently 
been  compiled  using  gee  3.2.  Other  gee  versions  have  not  been  tested  thoroughly,  therefore  we 
recommend  that  you  compile  with  this  version  of  gee.  We  compiled  the  source  code  using  GN 
make,  ‘make  sim-panalyzer’  generates  a  binary  for  the  simulator.  Go  to  the  root  directory  for  each 
version  ‘./Implementations/targetmachine’  and  execute  this  command.  This  should  generate  the 
executable  file  ‘sim-panalyzer’.  For  simple  tests  you  can  execute  small  programs  under  the  “./ 
Implementations/targetmachine/tests”  directory.  These  are  provided  from  the  simplescalar  toolset. 

Sample  tools  used  to  extract  effective  capacitance  for  various  functional  blocks  can  be  built 
by  going  to  the  ‘./pmodel/’  directory  and  executing  make. 

2.3  How  to  Run  the  Simulator 

We  tried  to  decouple  the  power  related  configurations  from  the  architectural  configurations. 
To  simplify  use,  we  have  created  a  separate  script  file  that  parses  the  cmd  file.  The  format  for  a 
cmd  file  is  similar  to  a  Microsoft  Windows  ini  file.  We  divide  the  configuration  variables  into  sec¬ 
tions  and  parse  through  these  sections  to  generate  an  appropriate  configuration  for  our  simulator. 
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[Component] 

AID 

DIO 

ILl  Cache 
DLl  Cache 
IL2  Cache 
DL2  Cache 
ITLB 
DTLB 

Branch_Predictor 

Bimodel 

Level 1 

Level2 

BTB 

RAS 

IRF 

FPRF 

Random  Logic 
Clock 

[Global] 

supply_voltage=l . 8 
frequency =2  00 

[AIO] 

frequency =2  00 
IO_voltage=3 . 3 
numberofbuf ferstages=5 
microstrip  length=10 
external  load=l 

[DIO] 

frequency =2  00 
IO_voltage=3 . 3 
numberofbuf ferstages=5 
microstrip  length=10 
external  load=l 


Figure  1:  Example  of  cmd  file 

An  example  of  a  cmd  file  is  shown  in  Figure  1.  Power  eonfigurations  ean  be  given  as  fol¬ 
lows  below. 

The  [Component]  seetion  that  is  shown  in  the  beginning  of  Figure  1  represents  the  eompo- 
nents  we  intend  to  analyze  for  power.  Currently  the  eomponents  we  support  are  Caehes,  Braneh 


6  of  54 


Target  Buffers,  Branch  Predictors,  Register  files.  Clock  Trees  &  Random  Logic.  Based  on  the 
chosen  components  in  the  [Component]  section  we  define  the  configuration  variables  in  the  fol¬ 
lowing  subsections.  For  example,  the  [AIO],  which  configures  the  address  10  pads,  has  the 
parameters  “frequency”  for  the  bus  frequency,  ‘TO_voltage”  to  describe  the  supply  voltage  for  the 
10  pad,  “Buffer  ratio”  for  buffer  sizing,  “microstrip  length”  for  modeling  the  PCB,  and  finally 
“external  load”  to  model  the  load  that  is  connected  to  this  10.  “test_arm.cmd”  & 
“test_alpha.cmd”,  which  are  located  in  the  source  code,  are  template  command  files  the  user  can 
use  as  a  reference. 

It  is  important  to  note  that  in  the  cmd  file,  we  assume  capacitance  to  be  in  pF,  time  unit  to  be 
in  ps,  frequency  to  be  in  MHz,  and  voltage  to  be  in  V. 

The  power  configurations  are  then  integrated  with  the  architectural  configurations  and  cre¬ 
ate  a  single  configuration  file.  We  provide  architectural  templates  for  a  4-wide  issue  Alpha  micro¬ 
processor  and  the  SAllOO  StrongARM.  Power  configuration  templates  are  also  provided  for  these 
two  microprocessors  in  the  “./cmd  files/”  directory.  The  typical  method  for  executing  Sim-Pana- 
lyzer  would  be  executing  the  gen_cfg_<target  machine>.pl  script  and  then  using  the  generated 
output  file  as  the  configuration  file  for  Sim-Panalyzer. 

>  gen_cfg_<target  machine>.pl  <architectural  config filename>  <PA  cmd filename> 
>sim-panalyzer  -config  < configuration  filename>  <executing program>  <program  param¬ 
eter  s> 
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3.  MOSFET  Capacitance  Component 

3.1  Model 

For  accurate  dynamic  power  estimation  of  a  circuit,  it  is  important  to  understand  the  intrin¬ 
sic  capacitance  components  of  a  transistor,  because  the  dynamic  power  dissipation  is  estimated 
based  on  those  capacitance  values  and  the  activity  ratio  of  the  transition  nodes.  Figure  2  shows  the 
intrinsic  capacitance  components  in  the  transistor  (or  MOSFET).  In  Figure  2-(a),  G,  B,  S,  D  nodes 
represent  gate,  body,  source,  and  drain.  C j  and  C J^^r  in  Figure  2-(b)  represent  the  junction  bottom 
area  and  sidewall  capacitance.  and  represent  the  channel  length  and  width, 

and  junction  length  and  width  of  the  transistor.  In  deep  sub-micron  technology,  most  of  the 
dynamic  power  dissipation  is  due  to  the  charging  and  discharging  of  gate  and  source/drain  capac¬ 
itance  during  each  transition.  Therefore,  we  need  an  accurate,  yet  simple  model  to  estimate  these 
capacitance  components  accurately  for  our  power  estimation  technique.  The  gate  capacitance  can 
be  computed  as  follows  [7]: 

^gate  -  (^poly  ^  ^  2XZ)  +  X  Wfjff  (1) 

where  ^ovlp^  XL  are  gate  poly  capacitance  per  unit  area  and  gate  overlap  (with  source 
and  drain)  capacitance  per  unit  length,  and  gate  overlap  length,  respectively,  see  the  SPICE 

parameters  specified  in  [8].  Moreover,  Lqjj  is  usually  fixed  to  be  the  minimum  channel  length  of 

Figure  2:  The  intrinsic  capacitance  components  in  a  transistor. 


(a)  lateral  view  (a)  top  view 
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the  technology  for  digital  circuits,  thus  the  only  unknown  variable  in  the  expression  is  the  channel 
width  Wqjj.  For  the  computation  of  source  and  drain  capacitances  we  use: 

Cj^  =  (AD)  X  Cj  +  PD  X  (2) 

where  AD  =  L^j^xW^j^  is  the  drain  area  and  PD  =  2x(L^jj+W^j^  is  the  drain  perimeter.  AD  and 
PD  can  usually  be  extracted  from  the  physical  layout.  Alternatively,  it  is  possible  to  obtain  a 
rough  estimate  based  on  the  design  rule  set  of  the  target  technology  and  the  design  structure: 
and  can  be  approximated  as  3  xL  and  W  for  small  size  devices. 

3.2  Implementation 

In  Sim-Panalyzer,  technology  specific  parameters  are  specified  in  “technology.h”.  Table  1 
summarizes  the  TSMC  0.1  Spm  technology  parameters  used  in  “technology.h”.  All  those  parame¬ 
ters  were  obtained  from  MOSIS  parametric  test  results  for  TSMC  0.18pm  CMOS  runs  [8].  If 
users  need  to  run  experiments  for  different  technologies,  they  can  update  “technology.h”  accord¬ 
ingly.  Figure  3  shows  part  of  a  MOSIS  parametric  test  for  a  TSMC  0.18pm  CMOS  run.  In  the 


Table  1:  Technology  parameters. 


Notation 

Physical  property 

Definition  in  “technology.h” 

Sample  value 

LCH 

Minimum  channel  length  for  the 
defined  technology 

LCH 

0.18pm 

XL 

Gate  (poly)  overlap  length 

XL 

0.02pm 

Gate  (poly)  capacitance  per  unit 

CPOLY  NDIFF  (NMOS) 

8460aF/pm^ 

^poly 

area 

CPOLY  PDIFF  (PMOS) 

8250aF/pm^ 

^ovlp 

Gate  (poly)  overlap  (with  source 
and  drain)  capacitance  per  unit 

COVLP  NDIFF  (NMOS) 

860aF/pm 

length 

COVLP  PDIFF  (PMOS) 

662aF/pm 

Ct 

Source/drain  junction  capaci¬ 

CJ_ND1FF 

970aF/pm^ 

tance  per  unit  area 

CJ_PD1FF 

iniaF/pm^ 

Source/drain  junction  side-wall 

CJSW_ND1FF 

261aF/pm 

^JSW 

capacitance  per  unit  length 

CJSW_PD1FF 

225aF/pm 

Lsd 

Source/drain  minimum  length 

LSD 

0.54/pm 
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Figure  3:  MOSIS  parametric  test  results  for  a  TSMC  O.lSpm  CMOS  run. 
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Figure  4:  Basic  data  structure  for  transistor  capacitance  estimation  in  “technology.!!”. 


/*  channel  type*/ 

typedef  enum  {PCH  /*  PMOS  */,  NCH  /*  NMOS  */}  channel_t; 

/*  cmos  gate  transistor  sizes  */ 
typedef  struct  { 

double  WPCH;  /*  PMOS  transistor  channel  width  */ 
double  WNCH;  /*  NMOS  transistor  channel  width  */ 

}  cmo  s  t ; 


results,  process  parameters  such  as  sheet  and  contact  resistance,  capacitance  parameters,  and 
SPICE  parameters  can  be  seen.  In  particular,  the  capacitance  parameters  are  used  to  build  Table  1 . 
For  example,  to  estimate  poly  (gate)  capacitance  over  the  N+  active  area  —  simply  poly  used  to 
build  NMOS  —  per  unit  area,  we  should  lookup  the  intersecting  number  from  the  “POLY”  col¬ 
umn  and  the  “Area  (N+active)”  row.  This  is  how  “CPOLY_NDIFF”  is  obtained  for  0.18pm  tech¬ 
nology  in  Table  1;  each  layer’s  capacitance  is  dependent  on  the  bottom  layer  connected  to  ground 
(GND). 

To  support  transistor-level  modeling,  two  basic  data  structures  are  provided  in  Figure  4. 
“channel  t”  represents  transistor  channel  type  —  “PCH”  for  p-type  and  “NCH”  for  n-type  chan¬ 
nel.  The  n-  and  p-type  channels  are  used  to  build  NMOS  and  PMOS,  respectively.  The  structure 
“cmos  t”  contains  the  transistor  width  for  a  pair  of  complementary  PMOS  —  “WPCH”  —  and 
NMOS  —  “WNCH”  transistors. 
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Table  2:  Basic  transistor  capacitance  component  estimation  functions  in  “technology.c”. 


Function  Name 

Argument 

Data-type  Name  Property 

Return 

Data-type 

estimate  MOSFET  CG 

channelj 

channel 

transistor  channel  type 

double 

double 

WCH 

transistor  width 

estimateMOSFET _CSD 

channelj 

channel 

transistor  channel  type 

double 

double 

WCH 

transistor  width 

Table  2  shows  the  eorresponding  functions  in  “technology.c”,  calculating  both  and  Q). 
Those  functions  are  fundamental  routines  to  build  a  power  model  for  more  complex  circuits  or 
functional  blocks. 
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4.  Interconnect  Capacitance  and  Resistance 
4.1  Models 

There  are  two  ways  to  estimate  the  intereonneet  eapaeitanee  and  resistanee.  One  way  is  to 
use  the  sheet  resistanee  for  eaeh  interconnect  layer  and  capacitance  parameters  specified  in  Figure 
3.  However,  it  is  complicated  to  determine  which  layer  should  be  used  for  the  interconnect  esti¬ 
mation;  this  requires  significant  knowledge  of  the  circuit  layouts  and  other  applicable  design  fea¬ 
tures.  The  other  way  is  to  use  the  Berkeley  Predictive  Technology  Model  (BPTM)  [9]. 

The  BPTM  estimates  the  interconnect  capacitance  and  resistance  for  the  given  interconnect 
material  and  dimensions,  and  dielectric  material  between  the  layers.  The  total  interconnect  capac¬ 
itance  consists  of  Cg  —  area  and  fringe  capacitance  to  the  underlying  plane  and  Q  —  coupling 
capacitance  to  the  adjacent  interconnects.  Those  capacitance  components  are  estimated  by: 


where  w,  s,  t,  and  h  represent  width,  space,  thickness,  and  height  of  the  interconnect  and  those 
models  are  accurate  in  0.16  <  w  <  2,  0.16  <5  <  10,  0.15  <  ^<  1.2,  0.16  <  /z  <  2.7  ranges,  see  Figure 
5  for  the  dimensions  of  the  interconnect  model. 

For  global  interconnect,  which  is  usually  the  top  interconnect  layer,  the  total  interconnect 
capacitance  Q  is  modeled  by: 

C^  =  2xC^  +  Cg  (5) 
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Figure  5:  The  dimensions  of  the  interconnect  model. 


assuming  that  there  are  three  adjaeent  intereonnects  in  the  left,  right,  and  bottom  sides  of  the  inter¬ 
connect.  Hence,  Q  becomes  a  sum  of  Cg  and  Q  multiplied  by  two.  However,  for  local  and  inter¬ 
mediate  interconnects,  there  is  another  layer  on  top  of  the  interconnect.  This  doubles  Cg  in  the 
total  interconnect  capacitance  equation.  Therefore,  the  total  interconnect  capacitance  Q  becomes: 

Cf  =  2  X  C^  +  2  X  Cg  (6) 

Caveat:  both  the  coupling  and  ground  interconnect  capacitance  is  sensitive  to  the  space  between 
the  adjacent  interconnects.  Therefore,  users  should  apply  the  interconnect  capacitance  model 
appropriately  depending  on  the  spacing  and  the  existence  of  the  adjacent  interconnect. 

Depending  on  the  interconnect  material,  the  resistivity  is  different.  For  instance,  Cu  (copper) 
andH/  (aluminum)  have  2.2  and  3.3  Q/cm  for  resistivity,  respectively.  The  interconnect  resistance 
for  the  given  resistivity  and  dimensions  is  estimated  by: 

R  =  ^  (7) 

w  ■  t 

where  p  is  the  resistivity  of  the  interconnect  material. 
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4.2  Implementation 

In  Sim-Panalyzer,  we  provide  the  interconneet  eapaeitance  and  resistance  estimation  func¬ 
tions  in  “technology,  c”,  see  Table  3  for  the  provided  functions.  In  Table  3, 
'^estimate Jnterconnect_CG"  and  estimate _interconnect_CC'  estimate  Cg  and  capacitance  in 
(3)  and  (4),  respectively  for  the  interconnect  capacitance.  Depending  on  the  layer  and  the  exist¬ 
ence  of  the  adjacent  interconnect,  users  should  combine  those  two  equations  appropriately.  For 
the  interconnect  resistance  estimation,  “estimate  interconnect  R”  which  is  based  on  (7)  is  pro¬ 
vided. 

However,  to  estimate  the  interconnect  capacitance  and  resistance  properly,  the  users  are 
required  to  provide  appropriate  interconnect  dimensions  for  the  estimation  model.  Table  4  shows 
typical  parameters  for  interconnect  capacitance  and  resistance  estimation  for  0.18,  0.13,  and 

0.10pm  technologies,  space  and  width  in  Table  4  represents  the  minimum  spacing  between  the 

Table  3:  Interconnect  capacitance  and  resistance  estimation  functions  in  “technology.c”. 


Function  Name 

Argument 

Return 

Data-type 

Name 

Property 

Data-type 

/ 

length 

estimate  jnterconnect JOG 

space 

w 

width 

double 

double 

h 

height 

estimate JnterconnectjOC 

t 

thickness 

k 

dielectric  constant 

1 

length 

s 

space 

estimate  Jnterconnect _R 

double 

double 

w 

width 

rou 

resistivity 

Table  4:  Interconnect  capacitance  and  resistance  estimation  parameters. 


0.18/0.13/0.10pm 

width  (pm) 

space(pm) 

thickness(pm) 

height(pm) 

dielectric  K 

Local 

0.28/0.20/0.15 

0.28/0.20/0.15 

0.45/0.45/0.30 

0.65/0.45/0.30 

Intermediate 

0.35/0.28/0.20 

0.35/0.28/0.20 

0.65/0.45/0.45 

0.65/0.45/0.30 

3.5/3.2/2.8 

Global 

0.80/0.60/0.50 

0.80/0.60/0.50 

1.25/1.20/1.20 

0.65/0.45/0.30 
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interconnect.  These  values  should  be  adjusted  based  on  the  specific  circuit  layout  dimension.  The 
intermediate  parameters  are  used  for  interconnects  in  bit-lines  and  word-lines  in  the  memory 
structure  and  the  global  parameters  are  used  for  interconnects  in  system  clocks  and  power  distri¬ 
bution  networks. 
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5.  General  Circuits 
5.1  Models 

For  digital  circuits,  once  node  eapaeitances  are  estimated,  the  next  step  is  to  gather  node- 
switehing  information.  We  compute  eaeh  switeh  on  the  fly  during  microarehitectural  simulation, 
beeause  total  dynamie  power  dissipation  is  heavily  dependent  on  the  number  of  switehes  at  the 
internal  nodes  [10][11]  for  some  circuit  blocks.  To  compute  the  number  of  switehes  in  each  node, 
it  is  neeessary  to  perform  logie  simulation.  Traditionally,  event-driven  logie  simulation  is  mueh 
slower  than  compiled-eode  levelized  logie  simulation.  Event-driven  logie  simulation  is  indispens¬ 
able  for  aecurate  timing-level  simulation.  However,  levelized  logic  simulation  is  enough  to  eom- 
pute  approximated  number  of  switehes  in  eaeh  node. 

To  support  logie  simulation  in  a  mieroarehiteetural  simulator  and  to  enhanee  the  simulation 
speed,  we  provide  a  set  of  generic  data  structures  and  functions  that  enable  users  to  eombine  these 
basie  bloeks  to  build  a  more  eomplex  functional  block.  In  this  modeling  teehnique,  the  users 
should  eonneet  each  individual  transistor  and  give  transistor  sizes  in  the  modeled  bloek.  This 
tedious  proeedure  is  inevitable  because  the  power  dissipation  of  funetional  blocks  can  be  different 
by  more  than  100%  depending  on  the  cireuit  style  and  transistor  sizes.  During  the  initialization 
proeess,  the  speeialized  logie  simulator  for  a  speeifie  funetional  bloek  estimates  the  switehing 
capaeitance  for  every  internal  node.  This  logic  simulator  is  embedded  in  the  mieroarehiteetural 
simulator  to  eapture  neeessary  inputs  and  access  activities,  and  generate  the  aeeumulated  power 
dissipation  statisties,  accordingly. 

First,  we  explain  the  generie  data  struetures  to  model  a  eireuit  node  and  logic  gate,  see  Fig¬ 
ure  6.  “node_t”  eontains  a  logie  value,  node  switching  capacitance,  and  energy  dissipation  of  the 
node.  This  is  one  of  the  basic  building  blocks  to  modeling  a  generie  gate.  “Igatel  f’  and 
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Figure  6:  A  generic  node  data  structure 


/*  netlist  node  type  */ 
typedef  struct  { 

bit_t  lvalue;  /*  logic  value  */ 

double  capacitance;  /*  node  capacitance  */ 

double  energy;  /*  transition  energy  */ 

}  node_t; 

/*  1-input  generic  logic  gate  type  */ 
typedef  struct  _lgatel_t  lgatel_t; 
struct  _lgatel_t  { 

node_t  *Y;  /*  current  output  */ 

node_t  *A;  /*  connected  input  node  ptrs  */ 

double  energy;  /*  energy  dissipation  of  the  gate  */ 

double  (*lgate_op) (lgatel_t  *lgate,  double  voltage); 

/*  logic  and  energy  evaluation  fn  of  the  gate  */ 

}; 

/*  2-input  generic  logic  gate  type  */ 
typedef  struct  _lgate2_t  lgate2_t; 
struct  _lgate2_t  { 

node_t  *Y;  /*  current  output  */ 

node_t  *A,  *B;  /*  connected  input  node  ptrs  */ 

double  energy;  /*  energy  dissipation  of  the  gate  */ 

double  (*lgate_op) (lgate2_t  *lgate,  double  voltage); 

/*  logic  and  energy  evaluation  fn  of  the  gate  */ 

}; 


“lgate2_t”  represent  generic  data  structures  for  1-  and  2-input  gates.  This  generic  logic  gate  type 
can  be  easily  extended  to  model  3-  or  more  input  gates  by  adding  input  nodes  in  the  data  structure. 

In  Figure  7,  we  show  a  modeling  example  of  the  two-input  CMOS  NAND  gate  —  the  most 
basic  logic  component  along  with  the  inverter  —  consisting  of  four  transistors  using  our  proposed 
methodology.  First,  we  declare  input/output  nodes  —  A,  B,  and  Y  and  a  generic  gate  —  NAND2. 
Second,  we  create  a  2-input  logic  gate  using  “create_lgate2”  connecting  the  necessary  inputs, 
assigning  transistor  widths,  relating  logic  and  assigning  a  energy  evaluation  function  — 
NAND2_op  to  Igate  op  in  the  gate  data  structure.  By  calling  “Igate  op”,  the  logic  value  and  the 
energy  is  evaluated. 
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/*  declare  a  generic  gate  for  NAND  */ 
lgate2_t  NAND2; 

/*  declare  gate  node  */ 
node_t  A,  B,  Y; 


WP1  WP2 


/*  declare  variables  storing  the  transistor  widths  of  the 
gate*/ 

cmost_t  AW,  BW;  ^  ® 

/*  assign  transistor  sizes  */ 

AW.WPCH  =  WPl;  AW.WNCH  =  WNl;  BW.WPCH  =  WP2 ;  BW.WNCH  =  WN2 ; 
/*  create  NAND2  gate  create  */ 

NAND2  =  create_lgate2 ( &A,  &B,  &AW,  &BW,  Static,  NAND2_op) ; 

/*  estimate  node  swiching  capacitance  */ 
NAND2->Y->capacitance 

=  estimate_capacitance_CSD (PCH,  ...)  +  ... 


WN1 


WN2 


B 


/*  evaluate  logic  and  energy  dissipation  */ 

A. lvalue  =  1;  B. lvalue  =  0;  /*  assign  inputs  */ 

energy  =  lgate_op-> (NAND2 ,  1.8);  /*  evaluate  logic  and  energy  dissipation*/ 


Figure  7:  Power  modeling  of  a  2-input  NAND  gate 

At  the  netlist  level,  multiple  gates  are  ereated  and  eonneeted  together  to  simulate  the  entire 
logie  bloek.  We  levelize  eaeh  gate  or  netlist  primitive  and  simulate  eaeh  gate  one  after  the  other, 
in  a  sequenee  eompatible  with  the  partial  ordering  imposed  by  levelization.  This  approaeh  eorre- 
sponds  to  the  levelized  eyele-based  simulation  teehnique  in  logie  simulation  [12].  As  a  small 
example,  the  following  illustrates  how  to  ereate  a  netlist  for  a  eombinational  eireuit  and  simulate 
the  internal  node  aetivity.  The  example  shows  a  eombinational  eireuit  eonsisting  of  2-input  NOR 
and  2-input  NAND  gates.  We  are  able  to  evaluate  the  eorreet  output  logie  value  by  evaluating  the 
gates  in  order  of  inereasing  distanee  from  the  primary  inputs.  The  levelized  approaeh  we  use  here 
most  often  performs  better  than  an  event-driven  simulation  sinee  we  trade  having  to  maintain  an 
event  queue  at  the  expense  of  simulating  every  gate  in  the  netlist  for  eaeh  time  interval  [12]  [13]. 
A  downside  of  the  levelized  approaeh  is  that  we  lose  information  on  arrival  times  of  signals,  thus 
we  eannot  evaluate  power  dissipation  due  to  glitehes  and  temporary  transitions.  However,  a  well- 
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designed  eombinational  eireuit  should  not  generate  many  glitehes,  in  whieh  ease  our  model  is 
fairly  aeeurate. 

This  proposed  modeling  methodology  ean  be  extended  to  model  a  more  complex  datapath 
or  memory  circuit.  In  both  cases,  since  they  have  regular  structures,  they  can  be  modeled  easily  in 
a  iterative  manner  after  modeling  one  component. 

5.2  Implementation 

We  provide  a  sub-set  of  data  structures  and  generic  functions  for  1-,  2-,  and  3-input  gate 
types  in  “./pmodel/logic.h”  and  “./pmodel/logic.c”.  The  detailed  usage  is  described  in  the  source 
code.  With  the  implemented  data  structures  and  generic  functions,  the  users  can  easily  extend 
those  according  to  the  instructions  given  in  “logic.h”  and  “logic. c”.  Table  5  lists  the  implemented 
generic  gate  modeling  functions. 


Table  5:  Generic  gate  modeling  functions  in  “logic.c”. 


Function  Name 

Argument 

Return 

Data-type 

Name 

Property 

Data-type 

create Jgate“n’’  (n=l,  2,  ...) 

nodej  * 

A,  B,  C,  ... 

input  —  A,  B,  C,  ... 

IgatelJ  * 

cmos_t  * 

a,  b,  c,  ... 

transistor  widths  con¬ 
nected  to  A,  B,  C,  ... 

IgstyleJ 

Igstyle 

logic  style 

double  (*  ) 

Igatejop 

function  pointer  evaluat¬ 
ing  logic  and  energy  dis¬ 
sipation  of  the  gate 

lgate_op 

lagte“n”  t 
(n=l,  2,  ...) 

Igate 

gate  to  be  evaluated 

double 

double 

voltage 

supply  voltage  of  the 
gate 
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6.  Memory  Power  Model 

In  modern  microprocessors,  static  random  access  memory  (SRAM)  is  extensively  used  for 
caches,  TLBs,  BTBs,  branch  predictors,  register  files,  instruction  queues,  etc.  For  instance,  40% 
of  the  total  power  in  the  Alpha  21264  and  60%  of  the  total  power  of  the  StrongARM  processor  is 
devoted  to  cache  and  memory  structures  [14]  [15].  As  feature  sizes  shrink  and  supply  voltages 
decrease  along  with  the  word-line  pulse  technique  [16],  bit-line  voltage  swings  during  read  opera¬ 
tions  decrease  to  lOOmV.  This  has  dramatically  reduced  the  power  consumption  from  the  bit-line. 

In  a  first  order  approximation,  users  can  obtain  an  effective  switching  capacitance  for  run¬ 
ning  Sim-Panalyzer  from  the  power  model  —  energy  dissipation  per  access  —  provided  by  the 
modified  CACTI  in  our  tool  set.  However,  the  following  modeling  technique  can  be  applied  to 
Sim-Panalyzer  to  estimate  more  accurate  memory  power  dissipation. 

6.1  Models 

While  the  bit-line  power  dissipation  is  independent  from  the  switching  activity  of  the  data 
due  to  the  complementary  structure  of  bit-lines,  the  power  dissipation  of  the  decoder  is  heavily 
dependent  on  the  switching  events  of  the  decoder  address  inputs.  Hence,  we  need  to  build  a 
switching  event-sensitive  power  model  for  the  decoder.  We  present  an  example  on  how  to  use  the 
technique  just  presented  in  Section  5  to  model  a  7x  128  decoder  designed  with  the  TSMC  0.18pm 
technology  Artisan  standard  cell  library  and  Synopsys®  Design  Compiler®.  Figure  8  shows  the 
7x128  decoder  logic.  The  decoder  logic  has  a  regular  structure  consisting  of  a  set  of  NANDs, 
NORs,  and  INVs.  To  measure  bit-line  energy  consumption,  we  used  the  following  equation: 


E= 


^bit-line  ^  ^DD  ^  ^^swing’ 


(8) 
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Figure  8:  7x128  decoder  logic. 


where  bit-line  eapaeitanee  per  memory  eolumn  and  is  the  bit-line  voltage 

swing.  includes  the  bit-line  intereonneet,  the  aeeess  transistor  drain  eapaeitanee,  and  the 

pre-eharge  eireuit  drain  eapaeitanee.  The  bit-line  intereonneet  eapaeitanee  was  estimated  based  on 
the  aetual  SRAM  dimensions  and  using  available  MOSIS  parametrie  test  results  from  the  TSMC 
0.18pm  teehnology  fabrieation  run  [8].  The  aeeess  transistor  drain  eapaeitanee  eonneeted  to  the 
bit-line  was  estimated  using  (2). 

6.2  Implementation 

Figure  9  and  Figure  10  show  the  eorresponding  deseriptions  of  funetions  for  ereating  the  3- 
to-8  deeoder  module  “create_module_dec3x8”,  and  evaluating  logie  and  energy 
“dec3x8_op”.  The  eyele-based  logie  simulator  for  the  deeoder  was  derived  by  instantiating  and 
eonneeting  those  gates  in  an  iterative  way  in  Figure  9.  The  switehing  eapaeitanee  of  eaeh  node 
was  automatieally  estimated  depending  on  the  eireuit  topology.  Then,  the  generated  logie  simula¬ 
tor  annotated  with  the  extraeted  eapaeitanee  is  embedded  in  the  mieroarehiteetural  simulator  with 
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Figure  9:  An  example  of  modeling  an  3x8  decoder. 

/**  dec3x8  consists  of  a  set  of  inverters  and  nand  gates.  These  routines  describe  mod¬ 
eling  examples  for  the  decoder  **/ 

module_dec3x8_t  *  /*  return  a  3x8  decoder  module  instance  pointer  */ 
create_module_dec3x8 ( 

node_t  *A[],  /*  input  node  pointers  */ 
node_t  *Y[],  /*  output  node  pointers  */ 

double  (*module_op) (module_dec3x8_t  *module,  double  voltage)  /*  module  logic  and 
energy  evaluation  fn  */) 

{ 

module_dec3x8_t  *module;  /*  top  module  data  structure  */ 

/*  create  module*/ 

module  =  (module_dec3x8  * ) malloc ( sizeof (module_dec3x8 ) ) ; 

/*  level-0  :  create  netlists  for  inverted  address  */ 
for (i  =0;  i  <  3;  i++)  { 

INVA[i]  =  create_lgatel  (A  [i ] ,  &xl[0].  Static,  INV_op) ; 

A_[i]  =  INVA[i]->Y;  } 

/*  estimate  drain  capacitance  of  the  connected  INV  gates  */ 
for (1  =0;  1  <  8 ;  i++) 

INVA [1 ] ->Y->capacitance  +=  (estimate_MOSFET_CSD (PCH,  ...)  +  ...); 


/*  level 

-1 

:  create  netlists 

for  NAND  gates 

*/ 

NAND3 [0] 

create 

lgate3 (A  [0 

],  A_[l 

],  A_[2] 

/  .  .  .  / 

NAND  3 

op) 

NAND3 [1] 

= 

create 

lgate3 (A[0] 

,  A_[l 

],  A_[2] 

/  .  .  .  / 

NAND  3 

op) 

NAND3 [2] 

= 

create 

lgate3 (A  [0 

],  A[l] 

,  A_[2] 

/  .  .  .  / 

NAND  3 

op) 

NAND3 [3] 

create 

lgate3 (A[0] 

,  A[l] 

,  A_[2] 

/  .  .  .  / 

NAND  3 

.op) 

NAND3 [4] 

= 

create 

lgate3 (A  [0 

],  A_[l 

,  A[2] 

/  .  .  .  / 

NAND  3 

op) 

NAND3 [5] 

= 

create 

lgate3 (A[0] 

,  A_[l 

,  A[2] 

/  .  .  .  / 

NAND  3 

op) 

NAND3 [6] 

create 

lgate3 (A  [0 

],  A[l] 

,  A[2] 

/  .  .  .  / 

NAND  3 

op) 

NAND3 [7] 

create 

lgate3 (A[0] 

,  A[l] 

,  A[2] 

/  .  .  .  / 

NAND  3 

.op) 

/*  estimate 

drain 

capacitance 

of  the 

connected  NAND 

gates 

*/ 

for  (1  =  0;  1  <  8 ;  i  +  +  ) 

NAND3 [i] ->Y->capacitance  +=  (3.  *  (estimate_MOSFET_CSD (PCH,  ...)  +  ...); 

/*  level-2  :  connect  outputs  of  the  netlist  */ 
for (1  =  0;  1  <  8;  i++) 

Y [1]  =  NAND3 [1] ->Y; 


return  module; 

} 
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Figure  10:  An  example  of  evaluating  an  3x8  decoder. 


/*  dec3x8  logic  and  energy  dissipation  evaluation  function  */ 
double  /*  return  energy  */ 
dec3x8_op ( 

module_dec3x8_t  *module  /*  module  to  be  evaluaed  */, 
double  voltage  /*  supply  voltage  */) 

{ 

/*  temporary  pointers  */ 
lgate3_t  **NAND3; 
lgatel_t  **INVA; 
node_t  **Y; 

double  energy; 
int  i; 

/*  retrieve  node  and  gate  instance  pointers  */ 

Y  =  module->Y; 

NAND3  =  module->NAND3; 

INVA  =  module->INVA; 

/*  logic  and  energy  dissipation  evaluation  */ 
energy  =  0 . ; 

/*  level-0:  inverter  gate  logic  and  energy  dissipation  evaluation  */ 
for  (i  =  0;  i  <  3;  i  +  +  ) 

energy  +=  INVA [i ] ->lgate_op ( INVA [ i ] ,  voltage); 

/*  level-1:  NAND  gate  logic  and  energy  dissipation  evaluation  */ 
for (i  =0;  i  <  8;  i++) 

energy  +=  NAND3 [ i ] ->lgate_op (NAND3 [ i ] ,  voltage); 

/*  return  energy  */ 
return  energy; 


an  interface  routine  passing  the  current  address  bus  value  to  the  logic  simulator  and  returning  the 
estimated  energy  consumption  to  the  microarchitectural  simulator. 

In  Figure  9,  “create_module_dec3x8”  creates  a  module  instance  for  3-to-8  decoder. 
This  module  consists  of  instantiating  and  connecting  the  basic  gates  to  implement  the  3-to-8 
decoder  function.  First,  a  memory  space  for  the  module  data  structure  is  allocated  with  “module 
=  (module_dec3x8  *)  malloc  (sizeof  (module_dec3x8 )  )  ”.  Second,  the  decoder 
components  are  created  and  instantiated  in  the  levelized  order  —  level  0  to  2.  In  level  0,  inverter 
gates  are  created  and  instantiated  to  generate  inverted  address  bus  signals.  In  level  1,  NAND  gates 
are  created  and  instantiated  to  form  the  3-to-8  decoder  function.  In  level  2,  the  outputs  of  the 
NAND  gates  are  connected  to  the  output  node  “Y”. 

After  the  creation  of  gate  instances,  we  estimate  the  source/drain  capacitance  of  each  gate; 
the  estimation  of  the  input  gate  capacitance  is  automatically  done  by  the  logic  gate  creation  func¬ 
tions.  The  reason  the  source/drain  capacitance  should  be  estimated  separately  is  that  the  output 
drain  capacitance  estimations  are  different  for  each  gate  type  depending  on  the  circuit  topology 
while  the  gate  input  capacitance  is  independent  from  the  gate  type. 

In  Figure  10,  to  evaluate  the  3-to-8  decoder,  the  module  function  “dec3x8_op”  is  derived 
by  evaluating  each  gate  instance  in  the  levelized  order.  First,  the  inverter  gates  are  evaluated  with 
the  applied  address  bus.  Second,  the  NAND  gates  are  evaluated  with  the  updated  node  logic  val¬ 
ues  from  level  1.  The  implemented  source  codes  ‘ .  /pmodel/dec3x8  .h’  and  ‘ .  /pmodel/ 
dec3x8  .  c’  explain  in  detail  on  how  it  works.  See  Table  6  for  information  on  the  implemented 
function  prototypes. 

Figure  11  and  Figure  12  show  the  corresponding  descriptions  of  functions  for  creating  the  7- 
to-128  decoder  module  “create_module_dec7xl2  8”,  and  evaluating  logic  and  energy 
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Table  6:  3-to-8  decoder  modeling  functions  in  “dec3x8 .  c”. 


Function  Name 

Argument 

Return 

Data-type 

Name 

Property 

Data-type 

create  _module_dec3x8 

nodej  ** 

A,  Y 

input  —  A,  and  output 
—  Y 

module_dec3x_ 
t  * 

cmos  t  ** 

x3WCH 

transistor  widths  for  3- 
input  NAND  gates 

CMOS  t  ** 

xlWCH 

transistor  widths  for  3- 
input  inverter  gates 

double  (*) 

modulejop 

function  pointer  evaluat¬ 
ing  logic  and  energy  dis¬ 
sipation  of  the  module 

dec3x8_op 

module_dec3x8 

J 

modulejop 

module  to  be  evaluated 

double 

double 

voltage 

supply  voltage  of  the 
module 

“dec7xl2 8_op”.  “create_module_dec7xl2 8”  instantiates  3  “dec3x8”  modules,  NOR 
gates,  and  inverters  to  implement  the  “dec7xl2  8”  module.  In  addition,  “dec7xl2 8_op” 
reuses  “dec3x8_op”  to  evaluate  the  logie  and  energy  dissipation.  The  implemented  souree 
eodes  ‘ .  /pmodel/dec7xl28  .h’  and  ‘ .  /pmodel/dec7xl28  .  c’  explain  in  detail  on  how 
it  works.  See  Table  6  for  information  on  the  implemented  funetion  prototypes. 

The  hierarchieal  structure  along  with  the  re-use  property  allows  users  enormous  flexibility 
and  reduces  the  tremendous  modeling  efforts.  We  can  create  a  more  complex  function  module  by 
instantiating  simple  modules  and  connecting  the  nodes.  The  above  examples  are  for  a  specific  cir¬ 
cuit  type  of  memory  decoder,  but  if  the  users  want  to  evaluate  different  circuit  styles  with  the 
same  decoder  function,  they  only  have  to  recombine  the  gates  and  reconnect  the  internal  nodes. 

6.3  Calibration 

Figure  13-(a)  shows  the  calibrated  energy  consumption  of  a  4KB  SRAM  power  model 
against  HSPICE  measurement.  In  the  figure,  each  point  represents  the  energy  consumption  for 
each  applied  vector.  For  the  HSPICE  experiment,  we  modeled  and  simulated  the  whole  7x128 
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Figure  11:  An  example  of  modeling  an  7x128  decoder, 

/**  dec7xl28  consists  of  a  set  of  inverters  and  nand  gates.  These  routines  describe 
modeling  examples  for  the  decoder  **/ 

module_dec7xl2 8_t  *  /*  return  a  3x8  decoder  module  instance  pointer  */ 
create_module_dec7xl28 ( 

node_t  *A[],  /*  input  node  pointers  */ 
node_t  *Y[],  /*  output  node  pointers  */  ...,s 

double  (*module_op) (module_dec7xl28_t  *module,  double  voltage)  /*  module  logic  and 
energy  evaluation  fn  */) 

{ 

module_dec7xl28_t  *module;  /*  top  module  data  structure  (1)  */ 

/*  allocate  space  for  the  module  instance  (2)  */ 

module  =  (module_dec7xl28_t  *)malloc (sizeof (module_dec7xl28_t) ) ; 

/*  level-0  :  create  netlists  for  3x8  decoders  */ 

/*  create  module  instances  and  connect  the  nodes  */ 

dec3x8[0]  =  create_module_dec3x8 (A,  dec3x8Y0,  &x3WCH[0],  &xlWCH[0],  dec3x8_op) ; 
dec3x8[l]  =  create_module_dec3x8 (A+3,  dec3x8Yl,  &x3WCH[0],  &xlWCH[0],  dec3x8_op) ; 
dec3x8[2]  =  create_module_dec3x8 (A+6,  dec3x8Y2,  &x3WCH[0],  &xlWCH[0],  dec3x8_op) ; 

/*  level-1  :  create  netlists  for  nor  gates  */ 

/*  allocate  space  for  the  NOR  instances  */ 
for(i  =  0;  i  <  8;  i++)  { 

for(j  =  0;  j  <  8;  j++)  { 

/*  create  gate  instance  and  estimate  the  output  node  drain  capacitance  */ 
NOR3[8*i+j]  =  create_lgate3 (dec3x8Y0 [ j ] ,  dec3x8Yl[i],  dec3x8Y2[0],  ...); 

NOR3[8*i+j]->Y->capacitance  +=  (1 . * (estimate_MOSFET_CSD (PCH, . . . )  +  ...); 

}  } 

for(i  =  0;  1  <  8;  i++)  { 

for(j  =  0;  j  <  8;  j++)  { 

/*  create  gate  instance  and  estimate  the  output  node  drain  capacitance  */ 
NOR3 [ 64+8*i+j ]  =  create_lgate3 (dec3x8Y0 [ j ] ,  dec3x8Yl[i],  dec3x8Y2[l],  ...); 
NOR3 [64+8*i+j ] ->Y->capacitance  +=  ( 1 . * (estimate_MOSFET_CSD (PCH, . . . )  +  ...); 

}  } 

/*  level-2  :  create  netlists  for  inverters  after  nor  gates  */ 

INVNOR  =  (lgatel_t  *  * ) calloc  ( 12 8 ,  sizeof (lgatel_t  *)); 
for (i  =0;  i  <  128;  i++)  { 

INVNOR[i]  =  create_lgatel (NOR3 [i] ->Y,  &xlWCH[2],  Static,  INV_op); 

INVNOR[i] ->Y->capacitance  +=  (1 . * (estimate_MOSFET_CSD (PCH,  ...)  +  . . . ) ;  } 

/*  level-3  :  create  netlists  for  inverters  after  INVNOR  gates  */ 
for(i  =  0;  i  <  128;  i++)  { 

/*  create  gate  instance  and  estimate  the  output  node  drain  capacitance  */ 
INVWL[i]  =  create_lgatel (INVNOR [i] ->Y,  &xlWCH[3],  Static,  INV_op); 

INVWL[i] ->Y->capacitance  +=  (1 . * (estimate_MOSFET_CSD (PCH,  ...)  +  ...);  } 

/*  level-4  :  connect  outputs  of  the  netlist  */ 
for (i  =0;  i  <  128;  i++) 

Y[i]  =  INVWL[i]->Y; 

return  module; 

} 
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Figure  12:  An  example  of  evaluating  an  7x128  decoder, 

/*  dec7xl27  logic  and  energy  dissipation  evaluation  function  */ 

double 

dec7xl28_op ( 

module_dec7xl2 8_t  *module, 
double  voltage) 

{ 

module_dec3x8_t  **dec3x8; 
lgate3_t  **N0R3; 
lgatel_t  **INVNOR; 
lgatel_t  **INVWL; 
node  t  **Y; 

double  energy; 
int  i ,  j  ; 

/*  retrieve  node  and  gate  instance  pointers  */ 

Y  =  inodule->Y; 
dec3x8  =  module->dec3x8 ; 

N0R3  =  inodule->N0R3; 

INVNOR  =  module->INVNOR; 

INVWL  =  module->INVWL; 

/*  logic  and  energy  dissipation  evaluation  */ 
energy  =  0 . ; 

/*  level-0:  3x8  decoder  logic  and  energy  dissipation  evaluation  */ 
for  (i  =  0;  i  <  3;  i  +  +  ) 

energy  +=  dec3x8 [i ] ->module_op (dec3x8 [ i ] ,  voltage); 

/*  level-1:  NOR  logic  and  energy  dissipation  evaluation  */ 
for (i  =0;  i  <  128;  i++) 

energy  +=  NOR3 [ i ] ->lgate_op (NOR3 [ i ] ,  voltage); 

/*  level-2:  INVNOR  logic  and  energy  dissipation  evaluation  */ 
for (i  =0;  i  <  128;  i++) 

energy  +=  INVNOR [i ] ->lgate_op ( INVNOR [i ] ,  voltage); 

/*  level-3:  INVWL  logic  and  energy  dissipation  evaluation  */ 
for (i  =0;  i  <  128;  i++) 

energy  +=  INVWL [ i ] ->lgate_op ( INVWL [ i ] ,  voltage); 


/*  return  energy  */ 
return  energy; 


Table  7:  3-to-8  decoder  modeling  functions  in  “dec3x8 .  c”. 


Function  Name 

Argument 

Return 

Data-type 

Name 

Property 

Data-type 

create  _module_dec3x8 

nodej  ** 

A,  Y 

input  —  A,  and  output 
—  Y 

module_dec3x_ 
t  * 

cmos  t  ** 

x3WCH 

transistor  widths  for  3- 
input  NAND  gates 

CMOS  t  ** 

xlWCH 

transistor  widths  for  3- 
input  inverter  gates 

double  (*) 

modulejop 

function  pointer  evaluat¬ 
ing  logic  and  energy  dis¬ 
sipation  of  the  module 

dec3x8_op 

module_dec3x8 

J 

modulejop 

module  to  be  evaluated 

double 

double 

voltage 

supply  voltage  of  the 
module 

decoder  and  a  dummy  128x256  bit  memory  array;  we  modeled  just  one  column  of  128  cells  mul¬ 
tiplied  by  256  to  speed  up  the  simulation.  As  seen  in  Figure  13-(a),  the  estimated  energy  con¬ 
sumption  follows  the  actual  measurement  result  closely  for  each  applied  vector.  The  proposed 
technique  has  an  average  7%  estimation  error  for  IK  vectors  compared  to  the  HSPICE  measure¬ 
ment.  However,  when  comparing  the  execution  time,  the  proposed  technique  completed  within  a 


Figure  13:  Calibration  of  SRAM  energy  consumption  model  in  (a)  and  LI  instruction  and  data 
cache  energy  consumption  in  (b). 
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few  seconds  while  the  HSPICE  took  3.4  hours  on  UltraSparcSO®  450MHz  dual  processors  with  a 
4MB  L2  cache. 

Figure  13 -(b)  shows  the  total  accumulated  energy  consumption  of  the  4KB  LI  instruction 
and  data  caches  obtained  by  running  10  million  instructions  for  a  subset  of  embedded  benchmark 
programs  from  the  MiBench  Benchmark  Suite  [17].  The  proposed  power  models  were  embedded 
in  Sim-Panalyzer  with  the  StrongARM  configuration  for  this  experiment.  In  the  process  of  esti¬ 
mating  energy  the  actual  address  stream  was  applied  to  the  SRAM  power  model  on  the  fly. 

The  estimated  energy  consumption  results  show  that  total  energy  consumption  can  be  sig¬ 
nificantly  different  depending  on  the  benchmark  programs  even  if  the  same  number  of  instruc¬ 
tions  are  executed.  Usually,  the  instruction  cache  consumes  more  energy  than  the  data  cache. 
However,  the  average  energy  dissipation  per  access  of  a  data  cache  is  usually  higher  than  that  of 
an  instruction  cache.  The  primary  reason  for  this  energy  consumption  behavior  is  that  the  activity 
ratio  of  an  instruction  cache  is  higher  than  the  data  cache,  while  the  address  stream  supplied  to  the 
data  cache  is  more  non-sequential  than  the  instruction  cache,  which  means  more  switching  events 
in  the  address  bus.  These  characteristics  imply  that  both  input  switches  and  the  access  activities 
for  the  application-specific  functional  block  must  be  considered  for  accurate  power  estimation  of 
embedded  microprocessors.  In  terms  of  overhead  for  the  microarchitectural  simulator,  the  pro¬ 
posed  technique  increases  the  execution  time  by  3%  for  both  the  instruction  and  data  cache. 
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7.  Datapath  and  Execution  Unit 

7.1  Implementation 

We  now  present  an  example  on  how  to  use  the  proposed  teehnique  to  implement  a  datapath 
eomponent  and  generate  power  estimations  that  interfaees  at  run-time  with  the  miero-architee- 
tural  simulator.  For  this  example,  we  eonsider  a  32-bit  carry-select  adder  consisting  of  eight  8-bit 
ripple-carry  adders  as  reported  in  Figure  14.  For  each  8-bit  add,  two  8-bit  ripple-carry  adders  are 
used  to  compute  the  results  in  parallel  for  zero  and  one  carry-ins,  respectively.  The  first  step  is  to 
construct  the  basic  block  for  a  full  adder  by  instantiating  the  necessary  logic  gates:  the  construc¬ 
tion  of  the  class  FullAdder  creates  all  the  internal  gates  and  it  properly  connects  them,  so  that  two 
output  nodes,  S  and  CO,  produce  the  correct  functionality.  By  this  point,  the  main  program  can 
create  and  connect  the  full  adder  blocks  employing  a  loop  shown  in  Figure  14.  Note  how  the  pro¬ 
gram  structure  lends  itself  naturally  to  the  parameterization  of  the  bus  width.  By  instantiating 
eight  8-bit  ripple-carry  adders,  we  are  able  to  build  a  32-bit  carry  adder.  By  this  point,  the  model 
includes  a  complete  description  of  the  logic  block  under  observation.  The  last  two  steps  provide 
an  interface  to  the  microarchitectural  simulator  by  retrieving  on-the-fly  at  each  cycle  the  input 
vectors  corresponding  to  the  two  operands  of  the  add  operation,  and  proceeding  with  the  power/ 


Figure  14:  An  example  of  modeling  an  8-bit  RCA. 

/*  step  1:  create  netlist  */ 
for(i  =1;  i  <  WIDTH;  i++)  { 

/*  create  and  connect  FullAdder  instances  */ 

FA[i]  =  FullAdder (A [i] ,  B[i],  FA[i-l].CI,  CO[i],  SO[i]...);} 

/*  step  2:  load  input  vectors  */ 

A.  apply  (LOp) ;  B .Apply (ROp)  ; 

/*  step  3:  logic  and  energy  evaluation*/ 
for(i  =  0;  i  <  WIDTH;  i++)  { 

energy  +=  FA [ i ]. GateOp (voltage ); } 
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Table  8:  3-to-8  decoder  modeling  functions  in  “dec3x8 .  c”. 


Function  Name 

Argument 

Return 

Data-type 

Name 

Property 

Data-type 

create  _module_dec3x8 

nodej  ** 

A,  Y 

input  —  A,  and  output 
—  Y 

module_dec3x_ 
t  * 

cmos  t  ** 

x3WCH 

transistor  widths  for  3- 
input  NAND  gates 

CMOS  t  ** 

xlWCH 

transistor  widths  for  3- 
input  inveter  gates 

double  (*) 

modulejop 

function  pointer  evaluat¬ 
ing  logic  and  energy  dis¬ 
sipation  of  the  module 

dec3x8_op 

module_dec3x8 

J 

modulejop 

module  to  be  evaluated 

double 

double 

voltage 

supply  voltage  of  the 
module 

logic  simulation.  The  implemented  souree  eodes  ‘ . /pmodel/rca .  h’  and  ‘./pmodel/ 
rca  .  c’  explain  in  detail  on  how  it  works.  Those  power  models  are  parameterizeable;  by  ehang- 
ing  “WIDTH”  in  the  funetion,  the  bit-width  of  the  implemented  adder  ean  be  easily  ehanged. 

7.2  Calibration 

We  ealibrated  our  model  by  eomparing  the  results  with  the  eorresponding  HSPICE  eireuit 
simulation.  Figure  15-(a)  shows  a  ealibration  using  the  earry-select  adder  in  the  previous  ease 
study.  Eaeh  point  in  the  graph  represents  dissipated  energy  estimated  or  measured  by  applying 
eaeh  veetor  to  the  eireuit.  The  diagram  indieated  that  the  teehnique  proposed  traeks  the  aetual 
power  dissipation  of  adders  very  well;  we  found  that  the  average  estimation  error  by  applying  IK 
veetors  is  around  9%.  The  steady  under- approximation  error  of  the  power  estimator  ean  be 
explained  by  two  sourees  of  power  dissipation  that  our  model  does  not  take  into  aeeount:  glitehes 
oeeurring  beeause  of  the  relative  delays  among  signal  propagation  times  and  temporary  short  eir- 
cuits  due  to  both  PMOS  and  NMOS  transistors  being  turned  on  during  the  transition. 
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To  produce  the  graph  in  Figure  15-(b),  we  simulated  the  SPEC2000  INT  benchmark  pro¬ 
grams  [18]  while  running  the  power  simulator  on  our  32-bit  adder  component.  For  each  bench¬ 
mark  program,  we  applied  32K  veetors  to  the  power  model.  The  results  show  the  total  energy  that 
was  dissipated  in  the  adder.  Note  how  the  total  energy  dissipation  profiles  present  high  variations 
over  different  benchmark  programs.  For  instance,  mcf  consumes  480%  more  energy  than  eon 
whieh  seems  to  indicate  that  the  amount  of  data  aetivity  plays  an  important  role  in  the  aecurate 
estimation  of  the  power  dissipation  of  a  datapath  eomponent.  Beeause  of  its  aecuracy  and  flexibil¬ 
ity,  this  technique  could  easily  be  applied  in  trade-off  studies  of  various  solutions  for  datapath  cir¬ 
cuits,  or  for  optimization  of  power  dissipation  in  embedded  processors  where  the  datapath 
constitutes  a  significant  portion  of  the  total  power  dissipation. 


Figure  15:  Calibration  of  32-bit  CSA  energy  consumption  model  in  (a)  and  total  energy  consumpti 
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8.  Clock  Distribution  Tree 

Ever  increasing  clock  frequencies  and  die  area  of  microprocessor  designs  require  more 
aggressive  clock  distribution  networks  (or  trees).  As  a  result,  the  fraction  of  total  clock  power  dis¬ 
sipation  has  become  more  significant,  depending  on  the  target  clock  frequency  and  the  maximum 
allowable  clock  skew.  In  the  case  of  a  small  embedded  processor  design,  such  as  the  StrongARM, 
the  clock  distribution  network  consumes  only  10%  of  the  total  power  [15].  However,  for  the 
Alpha  21264  microprocessor,  the  clock  consumes  up  to  32%  (23 W)  of  the  total  average  chip 
power  (72W)  [14],  and  the  percentage  of  total  clock  power  is  expected  to  keep  increasing  for 
high-end  microprocessors  that  employ  aggressive  clock  frequencies  and  pipeline-depths  [25]. 
Hence,  accurate  estimation  of  clock  power  is  an  important  key  for  accurate  total  microprocessor 
power  estimation,  and  the  fraction  of  clock  power  is  substantial  whether  we  consider  embedded  or 
high-end  microprocessors. 

8.1  Models 

In  a  tree-style  clock  distribution  system,  the  power  consumption  of  a  clock  distribution  tree 
consists  of  three  components: 

•  Clock  distribution  tree  interconnects. 

•  Clock  buffer  gates  and  parasitics. 

•  Clocked  nodes. 

In  the  Alpha  21264,  the  power  dissipation  of  the  clock  distribution  tree  interconnect  and  buffers  is 
65%  of  the  total  clock  distribution  system  power.  Assuming  that  an  H-tree  style  clock  distribution 
system  is  employed,  the  total  interconnect  capacitance  of  a  clock  distribution  tree  becomes: 

tree 

C H-tree  =  ^int  ><  J^e  ^  ^  ^  +  l 
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where  A^i^,  and  represent  the  intereonneet  eapaeitanee  per  unit  length,  ehip  die  area,  and 


the  number  of  levels  of  depth  of  the  tree,  respeetively  [7].  Also,  Nf^gg,  the  depth  of  the  tree,  is 
given  by: 


N, 


tree 


^int  ^  ^int 

cskew 


+  1  5 


(10) 


where  and  cskew  represent  the  intereonneet  resistanee  per  unit  length  and  maximum  allowable 
eloek  skew. 

As  seen  from  (9)  and  (10),  there  are  several  variables  that  impaet  the  eapaeitanee  of  a  eloek 
distribution  intereonneet;  the  estimation  of  eloek  distribution  intereonneet  eapaeitanee  is  more 
eomplieated  in  the  ease  of  other  eloek  distribution  styles  sueh  as  balaneed  H-tree  or  tree  driven 
grids.  Henee,  it  is  extremely  diffieult  to  estimate  all  needed  parameters  aeeurately  at  the  mieroar- 
ehiteetural  level.  We  estimate  and  using  the  intereonneet  eapaeitanee  and  resistanee  esti¬ 
mation  model  we  provide  in  (3),  (4),  and  (7)  (see  Table  3  for  the  funetions  provided  for  the 
intereonneet  parameter  estimations  in  Sim-Panalyzer). 

In  Watteh,  for  example,  the  ehip  die  area  and  the  depth  of  the  eloek  distribution  tree  are 
fixed.  However,  both  parameters  can  change  significantly  as  the  microarchitectural  and  circuit 
parameters  are  changed.  For  instance,  the  addition  of  on-chip  L2  caches  cause  major  increase  in 
die  area.  In  addition,  the  physical  implementation  phase,  such  as  placement  and  route,  can  affect 
the  global  chip  area.  Even  worse,  estimating  parameters  such  as  maximum  allowable  clock  skew 
requires  an  in-depth  understanding  for  both  circuit  design  and  semiconductor  process  knowledge 
incorporated  with  the  target  chip  specification. 
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Figure  16:  Power  consumption  of  clock  distribution  tree  for  maximum  allowable  clock  skew  and 

microprocessor  die  area. 
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H-tree  is  assumed  for  the  clock  distribution  tree  topology.  The  100%  normalized  die  area  corresponds  to  that  of  Alpha  21264 
implemented  with  0.35|im  technology.  The  clock  skew  is  relative  fraction  of  600MHz  clock  frequency. 

Figure  16  shows  the  sensitivity  of  a  cloek  distribution  tree  power  to  the  maximum  allowable 
elock  skew  and  ehip  die  area  of  a  mieroproeessor.  In  this  experiment,  an  “H-tree”  is  assumed  for 
the  clock  distribution  tree  topology.  The  100%  die  area  in  Figure  16  corresponds  to  that  of  the 
Alpha  21264  implemented  with  0.35pm  technology.  The  clock  skew  in  Figure  16  is  the  relative 
fraction  of  the  600MHz  clock  frequency  of  the  Alpha  21264  microprocessor.  According  to  the 
estimation  used  in  (9)  and  (10),  the  estimated  global  clock  distribution  tree  power  with  2.9%  max¬ 
imum  allowable  clock  skew  is  4W,  which  agrees  with  the  published  value  in  [26].  However,  as 
seen  in  Figure  16,  the  clock  distribution  tree  power  has  exponential  dependency  on  the  micropro¬ 
cessor  die  area  and  maximum  allowable  clock  skew.  For  instance,  the  estimated  clock  distribution 
tree  power  dissipation  increases  by  200%  when  the  target  clock  skew  is  changed  from  2.5%  to  2% 
at  the  100%  die  area  point.  If  the  microprocessor  die  area  increases  from  100%  to  125%,  a  2.5% 


150% 

100% 
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clock  skew  point  also  results  in  200%  increase  of  the  elock  distribution  tree  power.  Hence,  a  slight 
misprediction  of  either  eloek  skew  or  chip  area  ineurs  a  significant  error  in  power  estimation. 

The  power  consumption  in  Figure  16  is  estimated  only  with  the  eloek  distribution  tree  wire 
capaeitance.  The  elock  distribution  buffers  also  dissipate  a  substantial  amount  of  power,  whieh 
can  be  estimated  by: 

^sw,  elk  ~  ^^H-tree  ^ elk load^  ^ 

^  ^  elk  buffer 


1  z 


+ 1 


(11) 


where  a^ij^  represents  the  tapering  factor  or  the  optimal  stage  ratio  for  the  clock  buffer  [7] 
and  a^ij^  for  Alpha  21264  is  around  2.7nF.  (9),  (10),  and  (11)  give  a  total  power  of  26W  for  the 
eloek  distribution  including  switching  of  clocked  nodes,  which  is  similar  to  23  W  reported  in  [26]. 
Out  of  the  26W  in  total  eloek  distribution  power,  14. 7W  is  consumed  by  the  eloek  buffers,  which 
is  quite  signifieant.  However,  to  estimate  the  eloek  buffer  power  accurately,  the  exact  amount  of 
clock  node  capacitance  must  be  known,  whieh  requires  detailed  information  on  the  number  and 
sizes  of  flip-flops  in  the  microproeessor.  Depending  on  the  individual  device  sizes  and  numbers, 
the  power  eonsumption  of  a  eloek  distribution  system  can  be  quite  different. 

In  summary,  it  is  extremely  difficult  to  estimate  the  clock  power  accurately  due  to  too  many 
uncertainties  at  the  microarehiteetural  level.  In  particular,  one  must  have  aecurate  die  area  and 
clock  node  eapaeitance  estimates  of  the  target  mieroproeessor  which  are  strongly  dependent  on 
ehanges  in  the  mieroarchitectural  and  eireuit  implementation.  However,  accurate  power  estima¬ 
tion  of  the  clock  distribution  system  is  very  important  sinee  it  comprises  a  large  fraction  of  power 
dissipation  and  it  can  vary  in  a  wide  range  with  small  changes  to  parameters.  Therefore,  a  proper 
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clock  skew,  die  area,  and  eloeked  node  eapacitanee  must  be  specified  for  aeeurate  estimation  of 


total  clock  power  dissipation. 

8.2  Implementation 

In  Sim-Panalyzer,  we  provide  a  data  structure  to  specify  the  elock  distribution  tree  styles; 
we  support  two  cloek  tree  styles  —  H  tree  and  balanced  H  tree,  see  Figure  17  for  the  elock  tree 
speeifying  data  strueture.  This  data  structure  is  used  to  specify  the  clock  distribution  tree  style  for 
the  cloek  distribution  tree  switching  capacitance  estimation  function  —  “estimate  switching  CT” 
—  based  on  (9),  (10),  and  (11).  The  “estimate  switehing  CT”  returning  total  switching  capaci¬ 
tance  of  the  clock  tree  is  provided  with  other  related  sub-routines  in  “cloek.c”.  The  users  should 
specify  the  clock  tree  style,  target  elock  skew,  eloeked  die  area,  eloeked  node  eapaeitance,  and  the 
number  of  optimal  buffer  stages  for  the  clock  distribution  buffer. 

Finally,  we  provide  a  command-line  executable  “clock-panalyzer”  with  the  following 
options: 


Figure  17:  Basic  data  structure  for  clock  distribution  tree  style  in  “technology.h”. 

/*  clock  tree  style  type*/ 

typedef  enum  {Htree  /*  H-tree  */,  balHtree  /*  balanced  H-tree  */}  clocktree_style_t; 


Table  9:  Clock  tree  switching  capacitance  estimation  functoins  in  “clock.c”. 


Function  Name 

Argument 

Return 

Data-type 

Name 

Property 

Data-type 

clocktree  _style 
J 

style 

clock  tree  style  —  H- 
tree  or  balH-tree 

double 

cskew 

clock  skew 

estimate  _switching JOT 

double 

double 

area 

switchingjCN 

clocked  die  area 

clocked  node  capaci¬ 
tance 

double 

int 

nbstagejopt 

number  of  optimal 
buffer  stage 
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-t  clock  tree  style  (e.g.,  -t  Htree  or  -t  balHtree) 


•  -s  clock  skew  in  picosecond  (ps)  (e.g.,  s  20) 

■y 

•  -a  die  area  in  mm  (e.g.,  -a  100) 

•  -n  number  of  clock  buffer  stages  (e.g.,  -n  4) 

•  -I  clocked  node  capacitance  in  pico-Farad  (pF)  (e.g.  -/  20) 

This  is  useful  to  perform  a  cloek  power  trade-off  study  among  those  option  parameters  such  as 
clock  tree  style,  clock  skew,  die  area,  clocked  node  capacitance,  etc. 


39  of  54 


9.  I/O 

Generally,  the  I/O  circuits  (dis)charge  a  large  amount  of  loading  capacitance  as  well  as 
require  higher  supply  voltage  than  the  microprocessor  core  (e.g.,  3.3V).  This  makes  the  I/O  cir¬ 
cuits  a  major  contributor  to  the  peak  power  dissipation  of  a  microprocessor.  Although  the  micro¬ 
processor  may  not  frequently  access  the  external  memory  through  the  I/O  in  the  presence  of  LI 
and  L2  on-chip  caches,  a  significant  amount  of  power  will  still  be  consumed  by  the  I/O  circuits; 
the  Alpha  21264  I/O  circuit  consumes  5%  ('-'3.5W)  of  total  power  on  average  [14].  Furthermore, 
the  fraction  of  power  dissipated  by  the  I/O  circuits  will  be  significantly  increased  because  there  is 
no  on-chip  L2  or  LI  cache  in  the  embedded  microprocessors.  However,  the  power  consumed  by  1/ 
O  circuits  has  been  ignored  or  not  modeled  properly  in  most  frameworks  for  microarchitectural 
power  estimation.  There  are  two  sources  of  error:  1)  the  lack  of  detailed  information  about  the 
external  loading  capacitance  connected  to  the  I/O  circuit,  and  2)  the  I/O  bus  transaction  model 
used  in  microarchitectural  simulator. 

9.1  Model 

Figure  18  shows  both  the  memory  I/O  access  modeling  in  the  microarchitectural  simulator 
and  the  cycle-accurate  I/O  bus  transaction  modeling.  For  example,  SimpleScalar  —  baseline  sim¬ 
ulator  for  most  microarchitectural  power  estimation  simulators  —  transfers  all  the  request  data 
blocks  at  the  call  time  of  the  external  memory  access  function  (e.g.,  mem_access  in  Figure  18) 
and  returns  the  access  latency.  The  typical  microprocessor  transfers  the  blocks  one  by  one  over 
several  I/O  bus  cycles  with  a  more  complex  data  transfer  protocol.  As  we  have  noted,  the  cycle- 
based  microarchitectural  simulators  derive  their  speed  from  abstracting  out  many  of  the  physical 
details.  Hence,  we  have  no  idea  about  the  details  of  the  memory  transfer  protocol  including  exact 
timing  and  bus  switching  activity  of  address  and  data  I/O  buses.  To  correct  this,  we  need  a  mech- 
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anism  or  modification  for  tracing  actual  I/O  address  and  data  during  the  I/O  transactions  in  a  cycle 
accurate  way.  To  provide  this  mechanism,  it  is  necessary  to  augment  the  simulator  to  trace  I/O  bus 
streams  and  feed  them  to  the  power  model  at  the  pertinent  I/O  transaction  cycle  as  illustrated  in 
Figure  18. 

Figure  19-(a)  shows  an  I/O  bus  power  model  accounting  for  the  actual  I/O  bus  switching 
activity  during  memory  I/O  bus  cycles.  In  this  model  the  number  of  “0”  to  “1”  transitions  of  the  1/ 
O  pin  is  counted  by  comparing  the  blocks  transferred  in  the  previous  and  current  I/O  bus  cycles. 
At  the  initiation  of  the  I/O  bus  transaction  cycle,  the  high- impedance  bus  state  is  assumed.  To  esti¬ 
mate  the  power  dissipation  from  the  I/O  bus  at  a  pertinent  I/O  cycle,  the  total  number  of  I/O  pin 
transitions  of  eaeh  block  is  transferred  to  the  I/O  circuit  power  model.  In  general,  the  switching 
capacitance  of  the  I/O  circuit  consists  of  the  intrinsic  (or  internal)  capacitance  by  the  I/O  circuit 
itself  and  the  extrinsic  (or  external)  capacitance  by  the  connected  chipset  and  by  the  PCB  inter¬ 
connect  between  the  microprocessor  and  the  chipset  I/O  pins.  The  amount  of  the  extrinsic  capaci- 


Figure  18:  The  memory  access  I/O  modeling  in  the  microarchitectural  simulator  and  cycle- 
accurate  I/O  transaction  modeling. 
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Figure  19:  The  I/O  bus  switching  activity  model  in  (a)  and  a  snapshot  of  power  dissipation  by  64- 
bit  processor  I/O  bus  in  (b). 
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In  (b),  it  is  assumed  that  the  front  system  bus  (or  I/O  bus)  operating  frequency  is  800MHz  and  voltage  is  2.6V  [21],  The  4-wide 
issue  machine  is  used  for  the  experiment. 


tance  driven  by  the  I/O  circuit  is  more  significant  than  that  of  the  intrinsic  capacitance  by  the  I/O 
circuit  itself.  Therefore,  it  is  important  to  estimate  the  extrinsic  capacitance  accurately. 

In  most  high-end  computer  systems,  the  microprocessor  is  not  directly  connected  to  the 
memory  module  in  the  PC  motherboard.  It  is  connected  to  the  memory  controller  through  the 
front  system  bus  (or  simple  I/O  bus).  Hence,  the  I/O  pin  capacitance  of  the  microprocessor  and 
chipset  should  be  known  as  well  as  the  PCB  interconnect  capacitance  of  the  front  system  bus. 
According  to  [21],  the  typical  package  pin  capacitance  of  both  the  microprocessor  and  chipset  is 
5pF  per  I/O  pin.  To  estimate  the  PCB  interconnect  capacitance,  some  details  about  interconnect 
dimensions  should  be  known.  The  interconnect  dimensions  and  layer  information  is  usually  found 
in  the  chipset  or  microprocessor  specification;  in  case  the  microprocessor  or  the  chipset  has  not 
been  developed,  the  most  recent  available  information  can  be  used.  The  estimated  PCB  intercon¬ 
nect  capacitance  per  inch  is  around  2.15pF  for  the  given  specification  in  which  the  minimum  and 
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maximum  allowed  front  system  bus  intereonnect  lengths  are  3”  and  6”,  respeetively.  Therefore, 
the  PCB  intereonnect  capacitance  of  the  front  system  bus  is  between  6.5pF  and  13pF  depending 
on  the  interconnect  length;  in  case  of  the  6”  front  system  bus,  the  PCB  interconnect  capacitance  is 
around  13pF,  which  results  in  a  total  of  23pF  per  pin  including  the  package  pin  capacitance  of 
both  the  microprocessor  and  chipset. 

With  the  I/O  bus  capacitance  and  the  detailed  bus  protocol  modeling,  we  were  able  to  esti¬ 
mate  the  power  dissipation  of  the  64-bit  microprocessor  I/O  bus  with  realistic  parameters  (see 
Figure  19-(b)  for  a  snapshot  of  I/O  bus  power  dissipation  when  running  eon).  The  experiment 
shows  that  the  power  dissipation  by  the  I/O  bus  is  substantial  whether  the  front  system  bus  inter¬ 
connect  length  is  3”  or  6”,  and  it  has  great  potential  to  contribute  to  the  peak  as  well  as  the  average 
power  dissipation  of  the  microprocessor  during  the  I/O  bus  cycles.  Furthermore,  this  experiment 
shows  that  counting  switching  activity  in  a  cycle  accurate  way  is  important,  because  the  power 
dissipation  by  I/O  at  a  specific  I/O  cycle  is  significantly  different  depending  on  the  number  of  I/O 
pin  switches. 

9.2  Implementation 

I/O  is  divided  into  4  subcomponents.  The  buffer  chain,  I/O  pad,  microstrip,  and  external  load.  We 
made  each  sub-component  configurable  from  the  cmd  file.  The  I/O  pad  information  was  extracted 
from  technology  libraries.  For  our  example,  the  TSMC  0.1  Sum  Artisan  Cell  Libraries  were  used. 
Microstrip  capacitance  was  extracted  from  an  impedance  calculator.  We  used  (12)  to  acquire  PCB 
capacitance  and  made  the  wire  length  configurable.  The  external  load  is  also  configurable.  We 
described  I/O  pads  into  2  types;  Bidirectional,  Unidirectional.  Bidirectional  implies  that  in  idle 
state  the  I/O  goes  to  high  impedance,  ‘Z’,  state.  Unidirectional  implies  that  it  maintains  the  last 
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active  value  in  idle  state.  When  an  external  memory  access  occurs,  we  create  a  queue  that  gener¬ 
ates  I/O  power  estimates  that  occur  during  memory  transactions.  Sim-Panalyzer  evaluates  these 
estimates  at  the  appropriate  cycle.  This  enables  us  to  estimate  peak  power  for  I/O  in  a  cycle  accu¬ 
rate  manner.  I/O  related  code  is  located  in  ‘ .  /pmodel/io_p analyzer  .  c’  and  ‘ .  /pmodel/ 
io_panalyzer .  h’. 
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Figure  20:  Cross  section  of  Microstrip 
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10.  Conclusion 

In  this  study,  we  provided  power  modeling  methodologies  for  deep  sub-micron  micropro¬ 
cessors.  We  introduced  a  simple  switching  capacitance  extraction  methodology  and  a  cycle-based 
logic  simulation  technique  which  can  be  easily  embedded  into  a  high-level  microarchitectural 
simulator,  for  instance  SimpleScalar.  The  high-level  microarchitectural  simulator  enables  the  user 
to  explore  a  much  larger  design  space  quickly.  Combining  this  high-level  simulator  with  the 
embedded  low-level  logic  simulator  gives  us  more  accurate  power  estimation  results  quickly  for 
application  specific  functional  blocks.  In  addition,  we  illustrated  and  calibrated  our  power  model¬ 
ing  for  caches,  execution  units,  and  I/O.  Our  experiments  show  the  power  models  track  HSPICE 
closely  for  each  applied  vector  as  well  as  producing  accurate  average  energy  dissipation.  This  is 
achieved  with  a  very  small  execution  time  overhead  and  is  therefore  a  highly  desirable  method  for 
estimating  the  power  usage  of  many  different  microprocessor  designs. 
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Appendix  A  -  Sim-iPAQ 

The  following  is  the  first  release  of  the  SimpleScalar  IPAQ  platform  simulator.  It  is  capable 
of  booting  the  Linux  operating  system  and  provides  a  root  filesystem  with  a  variety  of  useful 
ARM  Linux  utilities.  Comments,  questions,  or  bug  fixes  may  be  directed  to  the  authors  via  email 
at  ss-sa@cs.colorado.edu. 

In  this  release,  the  critical  platform  components  have  been  implemented  including  ARM 
instruction  emulation,  ARM  MMU  support,  and  I/O  models  for  the  ARM  IPAQ  real-time  clock, 
interrupt  controller,  serial  devices,  FLASH  and  DRAM  memory,  and  the  OS  timer.  The  simulator 
model  is  able  to  boot  the  Linux  kernel;  however,  some  bootloader  and  Linux  commands  are  still 
not  functioning  due  to  a  few  remaining  implementation  issues.  Nevertheless,  a  significant  amount 
of  functionality  exists  in  this  release,  therefore  we  have  made  it  available  and  will  update  the  code 
as  bugs  are  identified  and  fixed. 

A.l  Sim-iPAQ  Distribution  Components 

Platform  Simulator  -  The  platform  simulator  is  derived  from  SimpleScalar  version  3.0 
located  at  www.simplescalar.com.  It  includes  ARM  instruction  emulation,  ARM  MMU  support, 
and  I/O  models  for  the  ARM  IPAQ  real-time  clock,  interrupt  controller,  serial  devices,  FLASH 
and  DRAM  memory,  and  the  OS  timer. 

Platform  Console  -  The  platform  console  provides  serial  terminal  emulation.  It  connects  to 
the  platform  simulator  and  allows  the  user  to  issue  bootloader  and  Linux  command-line  com¬ 
mands  to  the  simulator. 

ARM  Bootloader  -  The  ARM  bootloader  is  installed  into  memory  at  simulator  initialization 
time.  It  decompresses  the  kernel  and  initializes  the  filesystem. 
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ARM  Linux  Kernel  -  The  ARM  Linux  Kernel  provides  operating  system  funetionality  and 
is  deeompressed  by  the  bootloader  at  initialization. 

ARM  Linux  Root  Filesystem  -  The  ARM  Linux  Root  Filesystem  provides  a  number  of  stan¬ 
dard  utilities  in  the  root  filesystem  that  are  available  after  the  ARM  Linux  Kernel  boots  on  the 
platform  simulator.  A  few  examples  of  these  standard  utilities  are  Is,  diflf,  and  mount. 

A.2  Building  the  Simulation  Environment 

The  following  seetions  will  deseribe  how  to  build  the  various  eomponents  of  the  Sim-iPAQ 
simulation  environment.  One  thing  to  note  before  beginning  is  that  this  release  of  Sim-iPAQ  has 
only  been  tested  on  RedHat  Linux  version  8.0  for  x86.  However,  it  will  probably  work  on  any  lit¬ 
tle-endian  platform  provided  that  the  build  uses  GNU  GCC  for  the  eompilation. 

A.3  iPAQ  Platform  Model 

The  Sim-iPAQ  platform  model,  loeated  in  the  "sim-ipaq/"  direetory  of  the  download 
arehive,  must  be  eonfigured  before  it  can  be  built.  In  addition  to  the  normal  SimpleScalar  config¬ 
uration  parameters  found  in  the  README  file,  the  variable  LINUX  PATH  in  the  Makefile  must 
be  set  to  the  root  location  of  the  Linux  build.  Once  the  normal  configurations  are  made  and  the 
Makefile  is  configured  as  above,  the  platform  simulator,  called  sim-ipaq,  is  built  with  the  follow¬ 
ing  command; 

make 

The  build  processor  will  also  compile  the  Platform  Console,  conveniently  called  console. 
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A.4  iPAQ  Bootloader,  ARM  Linux  Kernel,  and  Root  Filesystem 

Before  beginning  with  this  step,  it  should  be  noted  that  pre-built  versions  of  the  iPAQ  Boot- 
loader,  ARM  Linux  Kernel,  and  Root  Filesystem  are  provided  in  the  Sim- iPAQ  distribution,  thus 
you  may  skip  this  step  unless  a  eustom  kernel  or  filesystem  is  required. 

If  a  eustom  environment  is  desired,  the  first  thing  that  will  be  needed  is  an  ARM  eross  eom- 
piler  to  build  the  bootloader  and  ARM  Linux  Kernel.  One  such  cross  compiler  is  available  for 
SimpleScalar  at  the  following  location:  http://www. simplescalar com/v4test. html. 

The  iPAQ  Bootloader,  located  in  the  "linux-build/bootldr"  directory,  is  a  free  ARM-based 
bootloader  distributed  by  Compaq.  It  provides  a  variety  of  debug  functions,  plus  Linux  kernel 
decompression,  and  root  filesystem  initialization.  To  build  the  bootloader  execute  the  following 
command  in  the  bootloader  directory: 

make 

This  will  produce  the  file  "bootldr.bin",  which  is  an  ELF  binary  format  bootloader,  in  the 
format  expected  by  the  Sim-iPAQ  platform  simulator.  See  the  README  files  for  details  on  the 
commands  supported  by  the  bootloader.  Additional  documentation  is  available  by  executing  the 
"help"  command  at  the  bootloader  prompt. 

The  ARM  Linux  Kernel,  located  in  the  directory  "linux-build/linux",  has  a  complete  kernel 
build.  A  large  number  of  build  options  are  available  for  the  kernel  which  can  be  seen  in  the 
README  file  located  in  the  above  linux-build/linux  directory.  The  kernel  has  been  pre-config- 
ured  with  the  options  expected  by  the  Sim-iPAQ  platform  simulator.  To  build  the  ARM  Einux 
Kernel,  execute  the  following  command  in  the  kernel  directory: 

make  zimage 
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This  will  create  "zlmage.bin",  whieh  is  a  eompressed  ARM  Linux  Kernel  with  deviees  eom- 
piled  in  to  match  the  deviees  supported  by  the  Sim-iPAQ  platform  simulator. 

The  ARM  Linux  Root  Filesystem  provides  a  minimal  filesystem  available  to  users  onee  the 
ARM  Linux  Kernel  boots  on  the  Sim-iPAQ  platform  simulator.  The  root  filesystem  is  loaded  into 
simulated  FLASH  memory  as  a  eompressed-RAM  filesystem  (CRAMFS).  The  first  step  to  build¬ 
ing  a  eompressed-RAM  filesystem  is  to  build  the  filesystem  build  utility  "mkeramfs",  whieh  is 
loeated  in  the  direetory  "linux/kernel/seripts/eramfs/".  Build  this  utility  with  the  following  eom- 
mand  in  the  CRAMFS  directory: 

make 

Next,  assemble  a  filesystem,  on  the  loeal  host  filesystem,  with  exaetly  the  same  ARM  bina¬ 
ries  and  permissions  desired  on  the  CRAMFS.  To  ereate  the  CRAMFS  filesystem,  execute  the  fol¬ 
lowing  eommand: 

linux/kernel/scripts/cramfs/mkcramfs  init-2-5 6  init-2-56.  cramfs 

Where  "init-2-56"  is  the  top-level  directory  of  the  loeal  representation  of  the  filesystem  to 
ereate,  and  "init-2-56. cramfs"  is  the  name  of  the  file  that  will  eontain  the  eompressed  filesystem. 

A.5  Running  the  IPAQ  Platform  Model 

To  run  the  IPAQ  platform  model,  first  run  the  platform  simulator,  loeated  in  the  "sim-ipaq/" 
direetory,  with  the  following  eommand: 

sim-ipaq  linux-boot 

The  argument  "linux-boot"  indieates  that  the  platform  simulator  should  initiate  a  standard 
Linux  boot  sequenee.  The  standard  boot  sequenee  aecesses  fdes  in  the  direetory  speeified  by  the 
build  parameter  LINUX  PATH.  The  sequenee  first  reads  the  bootloader  exeeutable  "bootldr.bin". 
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then  the  eompressed  Linux  ARM  Kernel  "zlmage.bin",  and  finally  the  root  filesystem  "init-2- 


56.cramfs". 

After  reading  the  FLASH  ROM  eomponents  into  simulated  FLASH  RAM,  the 
platform  simulator  will  conneet  to  the  terminal  emulator.  The  terminal  emulator  is  the  user's 
aecess  point  to  the  Linux  simulation,  providing  a  means  for  entering  command  lines  to  the  boot- 
loader  and  Linux  shells.  The  platform  console  is  a  front-end  to  the  serial  device  emulator.  To  start 
the  platform  console,  enter  the  following  command  in  a  separate  window: 
console  -s  script-boot.txt 

This  will  initiate  a  platform  console  connection  to  the  running  sim-ipaq  simulator,  and  run 
an  initial  set  of  bootloader  commands,  listed  in  the  file  "script-boot.txt".  These  commands  are 
required  to  initialize  the  Linux  kernel  memory  and  CRAMFS  filesystem.  Once  the  commands 
complete,  the  Linux  kernel  will  boot,  after  which  the  user  can  enter  additional  commands  from 
the  platform  console  window. 
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